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ABSTRACT 


One of the major challenges to build a task-oriented dialogue system is that dialogue state transition 
frequently happens between multiple domains such as booking hotels or restaurants. Recently, the encoder- 
decoder model based on the end-to-end neural network has become an attractive approach to meet this 
challenge. However, it usually requires a sufficiently large amount of training data and it is not flexible to 
handle dialogue state transition. This paper addresses these problems by proposing a simple but practical 
framework called Multi-Domain KB-BOT (MDKB-BOT), which leverages both neural networks and rule- 
based strategy in natural language understanding (NLU) and dialogue management (DM). Experiments on 
the data set of the Chinese Human-Computer Dialogue Technology Evaluation Campaign show that MDKB- 
BOT achieves competitive performance on several evaluation metrics, including task completion rate and 
user satisfaction. 


1. INTRODUCTION 


In the past decade, dialogue systems have become an attractive topic and they can be classified into 
open-domain dialogue systems and task-oriented dialogue systems. One general approach to dialogue 
system design is to treat it as a retrieval problem by learning the relevance matching score between user 
queries and system responses. Inspired by the recent advances in deep learning, building an end-to-end 
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dialogue system has been a popular approach for its flexibility and extendibility. For example, encoder- 
decoder models based on recurrent neural networks (RNNs) directly maximize the likelihood of the desired 
responses when previous dialogue history data are available. However, two major drawbacks of those 
systems are that multiple training corpora are required and generic responses such as “I do not know” are 
likely to be generated. These drawbacks limit the generalization ability, especially for a task-oriented system 
in which knowledge from multiple domains is needed to understand users’ underlying intents. 


Compared to end-to-end approach, designing a task-oriented dialogue system as modularized pipeline 
is feasible. And each essential component is trained individually, including 1) Natural Language 
Understanding (NLU), to specify task domain and user intent and extract slot-value pairs, 2) Dialogue 
Manager (DM), to keep tracking the dialogue state and guide users to achieve a desired goal, and 3) Natural 
Language Generation (NLG), to generate responses. One of the challenges for a task-oriented dialogue 
system is that dialogue state transition frequently happens between multiple domains. If earlier components 
make mistakes in slot value extraction and errors are accumulated, the entire system’s functionality will be 
severely impaired. 


To address the complex dialogue state transition problem, we adopt the architecture of modularized 
pipeline and propose a multi-domain KB-BOT (MDKB-BOT), which leverages both rule extraction and 
neural networks. We run the evaluation experiments on the data set of the Chinese Human-Computer 
Dialogue Technology Evaluation Campaign and experimental results show that MDKB-BOT can robustly 
fulfill the frequent changes of user intent among three domains (flight, train and hotel) and achieve 
competitive scores based on human evaluation metrics. 


2. RELATED WORK 


As mentioned before, there have been a lot of research efforts in applying deep learning to task-oriented 
dialogue systems. One of the most effective approaches is to build a modularized pipeline system by 
connecting NLU, DM and NLG together. Traditional approach to NLU is to model domain classification/ 
intent detection as sentence classification while treating slot value pairs extraction as a sequence labeling 
task. A desirable NLU system should not be sensitive to intent error and slot error, especially for slot filling. 
For example, Xu and Sarikaya [1] applied a RNN to perform contextual domain classification and used a 
triangular conditional random field (CRF) based on a convolutional neural network for intent detection and 
slot filling. Jaech, Heck and Ostendorf [2] applied multi-task learning to achieve the goal to leverage 
knowledge of source domains with rich data to improve the performance on the target domain with little 
data. Bapna et al. [3] explored the role of context information in NLU via injecting previous dialogues into 
a RNN based encoder and a memory network. 


On the other hand, many attempts have been made to improve the architecture of DM. Recent research 
indicates that reinforcement learning (DL) holds promise for planning a dialogue policy based on the current 
dialogue state. Williams, Asadi and Zweig [4] proposed a model called Hybrid Code Networks (HCNs), 
which is a mixture of supervised learning and reinforcement learning. HCNs select a dialogue action every 
step by optimizing the reward for completing a task with policy gradient [5]. Faced with the sparse nature 
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of the reward signal in RL, Peng et al. [6] designed an end-to-end framework for hierarchical RL where a 
MANAGER is used to choose current goal (like a specific domain task) and a WORKER is used to take 
actions and help users finish the current subtask. Inspired by recent advances in RL, MrkSic et al. [7] 
introduced a belief tracker that can overcome the drawback of requiring a large amount of hand-crafted 
lexicons to capture some of the linguistic variation in users’ language. Their Neural Belief Tracking (NBT) 
models can reason over pre-trained word embeddings of system output, user utterance and candidate pairs 
in databases. 


As for NLG, most of the current work applied information retrieval technique to a large query-response 
database, or used template-based methods with a set of rules to map frames to natural language or 


wey 


generation models. Dušek and Jurčíček [8] encoded frames based on the syntax tree and used the seq2seq 
model for generation. 


3. PROPOSED FRAMEWORK 


The proposed framework is illustrated in Figure 1, which includes NLU, DM and NLG. The implementation 
of these components is described from Section 3.1 to Section 3.3. 
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Figure 1. The overall framework of the model consists of three components: (1) NLU module, which predicts 
intent domain and gives slot value of user utterance, (2) DM module, which outputs the dialogue action to NLG 
module, and (3) NLG module, which generates the final response. 
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3.1 Natural Language Understanding (NLU) 


The main tasks for NLU involve domain classification, intent detection and slot filling as illustrated in 
Figure 1. 


3.1.1 Domain Classification 


A convolutional neural network proposed by Kim [9] was adopted in domain classification. Let We R" 
be the d dimensional word embedding table, where v is the vocabulary size. Then sentence semantic 
representation of user query X ER" is obtained by looking up each word in W, where n is the number 
of words in this query. Then 1-D convolutional layer is adopted on X to extract n-gram features. However, 
in a convolutional neural network (CNN) errors may happen in some cases containing several domains’ 
descriptions. For example, “The train is cheaper, but to save time, give me airline flight schedules and flight 
timetables.” Thus, for our online model, some rule strategies are used to deal with this misclassification 
problem by constructing a keyword list from both corpora and databases, e.g., city name list. 


3.1.2 Slot Filling 


Slot filling is treated as a name entity recognition task where the popular begin-in-out (BIO) format is 
used for representing tags of each word in a query. Then Long Short-Term Memory (LSTM) scans the words 
and outputs the representation: 


f, = o(W,x, + Uh, + B;), (1) 
i, = o(Wx, + U,h,_, + by), (2) 
o, = o(W,x, + U,h,_, + b), (3) 
€, = tanh(W.x, + Uh; + 6), (4) 
C, =f OC, +h OČ, (5) 
h, = 0, © tanh(c,). (6) 


To enhance the ability to extract slot value pairs, a CRF network is connected to the output of LSTM or 
bidirectional LSTM (BLSTM). Then the score of a sentence X along with a path of tags Y can be calculated 
as the sum of the transition score A and the LSTM network score f: 


T 
S(X,Y,0) = X (LA y +h lny), (7) 
t=1 


where 0 is the trainable parameter of the LSTM network. 
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Figure 2 shows a bidirectional LSTM network enhanced with a CRF network on the top layer. For our 
online model, we apply both keyword matching and BLSTM-CRF to avoid cases like diverse or nonstandard 
expressions. 


B-date B-departCity O B-arriveCity (0) (0) 


CRF 


Forward 


KAAA A A 
Py hapagapggagt 
rer ea 


HR JE 3 Eit AY DLE 
Tomorrow Beijing to Shanghai Flight 


Figure 2. BLSTM-CRF model for slot filling. 


3.1.3 Intent Detection 


Based on slots extracted from BLSTM-CRF, we update the maintained dialogue state template. Then user 
intent is inferred by comparing the predefined dialogue template with the new state template. 


3.2 Dialogue Management (DM) 


After the NLU module, we obtain the output of NLU that includes the user intent domain and the slot 
value of the current turn. 


In order to avoid too many unnecessary turns of dialogue on some insignificant information for many 
users, we divide all of the slots into two categories: required slots and extra slots. The required slots, like 
<departCity>, are necessary for the task, and the extra slots may make the dialogue too tedious for many 
users who do not care them, such as <trainValue> and <countRate>. So we only complete the required 
slots necessary for booking, but if a user mentions extra slots in the dialogue, we will consider it while 
retrieving the information from our data base. 


To regulate the dialogue course when our system interacts with users, DM module is then applied to 
update the conversation state and the next dialogue action. We divide this module into three states. The 
detailed procedures are described as follows. 
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e Initial state At the beginning of the conversation, utterance with no explicit intention will be 
considered a purposeless talk. After identifying a user’s intent, the system will turn into the slot filling 
state. Note that the system will store the slot information even before domain prediction, and the 
information will be distributed to the corresponding slot afterward. 

e Slot filling state The main task at this state is to interactively interact with the user to obtain the 
required slot information for generating responses. 

e Recommendation state Our bot will list the results that can adapt to users’ demands by retrieval 
and extraction of the data base. In case of failure, we set a series of strategies for similar 
recommendations, such as: (1) remove the limitations of extra slots, (2) make appropriate adjustments 
for the departure time, (3) change the cabin or train type, and (4) increase the price range. In addition, 
users can change their requests and return to the slot filling state or recommendation state again. 


Throughout the whole process of the dialogue, when the system finds that an intent cannot be completed, 
it will prompt the user in time to avoid wasting time. For example, when a user wants to book a flight to 
a city where no air service is available, it is unwise to continue the dialogue with the user, and the system 
will recommend other means of traveling. A user can also change his or her intent anytime, and our system 
will store the common information of slots automatically at the transformation process. 


3.3 Natural Language Generation (NLG) 


So far, we have obtained both the category of a user query’s intention and the next dialogue action each 
turn, which guide the NLG module to generate natural language texts and replay the user’s query. Given 
the user’s slots list, we convert it into the SQL statement, then retrieve from the date base which stores 
information on trains, flights and accomendations, to check if there are eligible items for the user’s goal, 
and match the appropriate template for replay. 


Because of the shortage of the large-scale dialogue corpus in these domains, we generate the utterance 
by the template-based NLG, which means we need to capture every case of different slot states to presuppose 
the dialogue template. In this way, once the user dialogue actions are found in the predefined sentence 
templates, we will replace the slot value with user history information. One advantage is that it can ensure 
the controllability of the response given to users. 


4. EXPERIMENTS 


In the Chinese Human-Computer Dialogue Technology Evaluation Campaign, a task-oriented dialogue 
system is developed to help users book flights, trains and hotels. 
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4.1 Data Sets 


Since only three databases are provided, we extend the data set of Task 1 for domain classification and 
rule extraction. We also annotate a 300-dialogue corpus (about 1,500 sentences for training) with slot labels 
for evaluating LSTM-CRF and BLSTM-CRF models. Table 1 and Table 2 show the details of the data set. 


Table 1. The statistics of training corpus for domain classification. 


Train Flight Hotel Others 
#sentence 533 510 512 588 


Table 2. The statistics of training corpus for slot filling. 


#label Average sentence length 
#sentence 
Train Flight Hotel (word) 
13 15 5 1,013 5.86 


4.2 Evaluation 


For slot filling in NLU, the entity-level prediction F1 score of common name entity recognition is adopted. 
However, dialogue evaluation still remains to be a difficult task. We use the evaluation metrics of the 
Chinese Human-Computer Dialogue Technology Evaluation Campaign, including task completion rate, user 
satisfaction score, dialogue naturalness, number of turns and robustness of uncovered cases. 


5. DISCUSSION 


Table 3 shows the results of slot filling. We compare the performance of LSTM-CRF and BLSTM-CRF 
with unigram and unigram plus bigram separately. As illustrated, accuracy can increase by 1.7% when 
considering word sequence order with BLSTM. Using bigram feature can be of help for LSTM-CRF, though 
it is worse for BLSTM as the average length of bigram sequence is short. One possible way is to use character 
embedding instead of word embedding. 


Table 3. Comparison of labeling performance on NLU. 


Feature LSTM-CRF BLSTM-CRF 
Unigram 85.93% 87.63% 
Unigram + bigram 86.20% 87.21% 


Table 4 illustrates the performance of our system according to the evaluation metrics mentioned before. 
Most of the metrics are annotated manually except for the average dialogue turn. our system obtained the 
best score on user satisfaction, dialogue naturalness and boot ability due to the reasonable dialogue intent 
transition template we predefined. But this leads to a decline in task performance, especially when the user 
intent is not identified or the important slot value is not extracted correctly. 
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Table 4. Dialogue quality evaluation results between top 4 teams in the competition. 


Metric ShenSiKao PuTaoWeiDu MDKB-BOT ChuMenWenWen 
Completion rate 31.75 19.05 19.05 TETI 
#turn 64.53 72.28 78.72 71.39 
User satisfaction (0) -1 (0) -2 
Naturalness -1 1 1 z 
Boot ability 2 3 3 3 


6. CONCLUSION 


In this paper, we proposed a simple but practical framework for multi-domain task-oriented dialogue 
system. Our model leverages both neural network and rule-based strategy to handle the domain transition 
problem. It achieves competitive results on the Chinese Human-Computer Dialogue Technology Evaluation 
Campaign, especially for user-friendliness and utterance guidance metrics. For future work, we are going 
to apply end-to-end neural networks to NLG based on the information extracted and maintained in NLU 
and DM to improve the system performance. 
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